This lab will introduce you to metabarcoding. Investigating this “hidden diversity” often relies heavily on bioinformatics. Here, most of the analyses have already been carried out for you, and the focus is on interpreting the results of the analyses. All graphs in this practical have been generated in R. The code is visible for those interested (don’t worry if it does not make sense! The goal here is to interpret results) - but you can hide the code by clicking the Hide button above each code chunk. There are also a few dropdown boxes in the document tht contain extra information, an example is shown below:
To complete this lab, answer all questions (in bold) and discuss them with your teaching assistant at the end of the lab.
Good luck! :)
After completing this lab, you will be able to:
Microbes run the world! Protists and fungi play crucial ecological roles such as primary production, consumers and decomposers. But how do we study them given that they are so small? Furthermore, most of them are difficult to culture. One method is: using a metabarcoding approach!
Fig. 1: Metabarcoding overview (source:http://www.naturemetrics.co.uk)
Figure 1 shows an overview of the metabarcoding approach. The most commonly used DNA barcode for microbial eukaryotes is the 18S rDNA gene (though it should be noted that the internal transcribed spacer, ITS, is more commonly used for fungi). In this lab, we will focus on the 18S gene using subsets of published sequencing data from: global marinesampling expeditions such as Tara Oceans and the Malaspina Expedition, freshwater from Lake Baikal, and soils from Neotropical forests.
In this section, you will work with a subset of data from the Tara Oceans Expedition (de Vargas et al. 2015, Science). For this lab, we have gathered sequences from two marine environments: Coral Reefs, and Open Ocean. Each environment is represented by three stations each, meaning that you will be working with data from 6 stations in total. The location of these stations is shown in the map in Figure 2.
Fig. 2: Tara Ocean sampling stations (adapted with permission from Decelle et al. 2018, Cell). Here, we will focus on the circled stations.
The Tara Oceans project used a region (V9) of the 18S gene (the Small Ribosomal Subunit gene in eukaryotes, see Figure 3 below) as a metabarcode.
Fig. 3: A cartoon of the 18S gene. The V4 and V9 regions are the most barcodes for microbial eukaryotes.
These sequences were correspond to organisms living in surface water (~ 7 m depth) collected from the six stations. The water was size fractionated to recover biodiversity from major organismal size fractions: pico-nanoplankton (0.8 to 5 microns), nanoplankton (5 to 20 microns), microplankton (20 to 180 microns) and mesoplankton (180 to 2000 microns). However, in this lab we have merged all the size fractions for simplicity.
Normally, we start our data analysis with a LOT of reads or metabarcodes. The first step is to filter these reads based on quality and then cluster them into biologically meaningful Operational Taxonomic Units (OTUs). Here, in the interest of time, we have already cleaned the reads and clustered them into OTUs.
# read table containing metadata about stations
otus <- read.csv("data/metadata.txt", header=T, sep = "\t")
otus
## station no_of_OTUs no_of_reads type time.date Temp..Celcius.
## 1 station45 16376 9777277 open_ocean 2010-04-13T03:21 30.59345
## 2 station50 10532 2742747 open_ocean 2010-05-09T05:30 27.54930
## 3 station126 31777 4930749 open_ocean 2011-08-28T18:05 26.68965
## 4 station121 11639 2396972 coral 2011-07-09T17:47 24.23274
## 5 station46 15309 15360520 coral 2010-04-15T02:01 30.12475
## 6 station49 12782 1722440 coral 2010-04-23T10:10 28.27290
Questions:
1. The Tara Oceans Project found ~110,000 OTUs in total from surface waters of tropical and temperate oceans. Pick any one of the six stations. What proportion of total species diversity is present at that station?
OTUs are often defined using sequence similarity thresholds (traditionally sequences more than 97% dissimilar are treated as belonging to different species, though other criteria are also frequently used). But what do these OTUs actually represent? Does OTU = species? Here we will evaluate what OTUs represent using several different examples of described species.
You isolate two cells from soil samples in Nova Scotia, Canada and photograph them using contrast light microscopy (Figure 4).
Fig. 4: Two protist cells from soil, Canada (from Lax et al. 2018, Nature). Size of scale bar in a is 10 μm, and 5 μm in b.
Questions:
2. Based on morphology, do you think A and B are different species?
You then generate 18S sequences (provided below) for both cells to confirm their identity. You align them together to find how similar they are to each other. To do so, use the blast2seq website, and copy paste the sequences, one in each box. (NB: the following DNA sequences are in fasta format, a commonly used format for representing nucleotide and amino-acid sequences).
>seq_A GGTTCGATTCCGGAGAGGGAGCCTGAGAAATGGCTACCACATCTAAGGATGGCAGCAGGCGCGCAAATTACCCAATCCTGACACAGGGAGGTAGTGACAATAAATATCGATAGTCGCCTTTTTACAGGAGACTAATTGAAATGAGAACAATTTAAACCCCTTAACGAGGATCCATTGGAGGGCAAGTCTGGTGCCAGCAGCCGCGGTAATTCCAGCTCCAATAGCGTATATTAAAGTTGTTGCAGTTAAAAAGCTCGTAGTTGGATTTCAGAATCGTGCGTGGCCTGTAACAGGGTCATTCAAAGATTCGGAGCTTGATTCGAATGAGTTGAGCTTTGGGTCACGATTCTTACCTATTATTCACGGCGGGGCAACTCACCGAGGATAATTCGTTTACTGTGAAAAAATTAGAGTGTTCAAAGCAGGCGTTCGCCATTCATTATGAATACATTAGCATGGAATAATAAAATAGGACTTTAGTTTTATTTTATTGGATCCTAGGACTAAAGTAATGATTAATAGGGACAGTTGGGGGTATTCGTATTCTTGAGTCAGAGGTGAAATTCTTAGATTTCAGGAAGACGAACTTCTGCGAAAGCATTTACCAAGGATGTTTTCATTAATCAAGAACGAAAGTTGGGGGATCGAAGACGATCAGATACCGTCGTAGTCTCAACCATAAACGATGCCGACTAGGGATCAGTGAATGTTTTATATTTGACTTCATTGGCACCTCCAGAGAAATCGCAAGTTTTTGGGTTCTGGGGGGAGTATGGTCGCAAGGCTGAAACTTAAAGGAATTGACGGAAGGGCACCACCAGGAGTGGAGCCTGCGGCTTAATTTGACTCAACACGGGAAAACTTACCAGGTCCAGACATAGCAAGGATTGACAGATTGAAAGACCTTTCTTGATTCTATGGGTGGTGGTGCATGGCCGTTCTTAGTTGGTGGAGTGATTTGTCTGCTTAATTGCGATAACGAACGAGACCTTAACCTACTAACTAGTCACACTAACCTTCCGAAATGGTGTTGGTGGCGGACTTCTTAGAGGGACAACGTGTGCTGAGCACGAGGAAGTTTGAGGCAATAACAGGTCTGTGATGCCCTTAGATGTTCTGGGCCGCACGCGCGCTACACTGATACAATCAATGAGTTTCTACAAGACCTTAACTGATAAGTTTGGGTAATCTTTTGAAATTGTATCGTGATGGGGATAGATCATTGCAACTATTGATCTTCAACGAGGAATTCCTAGTAAGCGTAAGTCATCAGCTTACGTTGATTACGTCCCTGCCCTTTGTACACACCGCCCGTCGCTCCTACCGATTGGATGATTCGGTGAACTTTTCGGACCGTGATCTATTGTTGGGAAACCAGCGTGGATCG
>seq_B GAATCAGGGTTCGATTCCGGAGAGGGAGCCTGAAGCTGTGAAACAACGGGCTTGATAGTGGCCTGGCAACAGGACGCTAGTTACGCTCTGACCGTAGCGACACCTTCAAATTGCTGGAAGTCCCTAAAGCTCTTAGTACCGCGGCGCCTGGCAACAGAGCGTGCAGCAGGTGGTGTAGTGGCCACTGGGTGGTAACAATCTAAGAGATGACACAATGGGTAACCAGCAGCCAAGTGGCCTCTGGCCATGCAGTTCACAGACTAAATGTTGGTGGGCGACCTCTGGGTTGCTTAAGATATAGTCGGCCGGCCGCAGAAATGCGGTTCTCTAAGAGGAATCTTCACTGCGAGTCAGAGAGCACGCAGCAGTGGAGAGGAGAGCTTAGAGACAGCCAGCACTGGACCAGTGCCTCGGCTGCGATCAGCGGAGAAATGGCTACCACATCCAAGGATGGCAGCAGGCGCGCAAATTACCCAATCCTGACACAGGGAGGTAGTGACAATAAATATCAATAGTGGACTTTTTACAGTACACTAATTGAAATGAGAACAATTTAAATCCCTTATCGAGGAACAATTGGAGGGCAAGTCTGGTGCCAGCAGCCGCGGTAATTCCAGCTCCAATAGCGTATATTAAAGTTGTTGCAGTTAAAAAGCTCGTAGTTGGATTTCGGGGAGGTCAGGCGTCGTAGGGCAATAGTTCTGCGGTTGTCGGCTTCTCTTACCTGTCTGGCATTGCCCTCGTTGGTGGTGCTAGGCTCGTTTACTGTGAAAAAATTAGAGTGTTCAAAGCAGGCTTACGCTGTGAATACATTAGCATGGAATAATAGAATAGGACTTTGGCCCTATTTTGTTGGTTTCTAGGACCGAAGTAATGATTAATAGGGACAGTTGGGGGCATTCGTATTTTACAGTCAGAGGTGAAATTCTTAGATTTGTGAAAGACGAACTACTGCGAAAGCATTTGCCAAGGATGTTTTCATTAATCAAGAACGAAAGTTAGGGGATCGAAGACGATCAGATACCGTCGTAGTCTTAACCATAAACGATGCCAACTAGGGATGGGCAGATGTTAACTTTATGACTCTGTCCGCACCTCCAGAGAAATCGCAAGTTTT
Questions:
3. What is the percentage identity between the two sequences? Does this information change your answer to question 2? Why or why not?
Imagine that you collected several marine samples from different areas. You observe samples A and B under the microscope and take the following images (Figure 5).
Fig 5. Cells in samples A and B. Images re-used with permission from Tikhonenkov et al. 2019, PLOS One.
Questions:
4. Would you describe as the same species or different species based on the information you have?
5. Using the blast2seq server, what is the percentage identity between the two sequences (provided below)? Does it change your answer to the previous question?
>seq_A GGCTCCATATACTAGTGAGAACCTACCTACTCAGTCAACTACAAGGCTAACCTTGCCAACAGCAAGGCTAATACTTGACTAACCAGCTACCTTAGGGTAGCAGCGTAAGGTCTAGCTTGACAGCTACCTTACATAGTAAGCTACTAGCTTACTGGTTGACCTGTAAGGGCTAAGGGCTGCTACTAGTAGTAGCTCGACCTAAGGCTGAGTAGAGTGCTCTCTAAGCCATTAGCTGGTAGTAAGGGTAGCGTCCTTACTAGGCGACCATGGCTACGGGGAATCAGCGTTCGATTCCGGAGAGAAAGCCTGAGACACGGCTATCACATCTAAGGAAGGCAGCAGGCGCGCAAATTGCCCAATGCAGACTCTGCGAGGCAGTGAACAGCTATACTAGCCTACATACCTAGTATGTAGGGAGGCTAATGCCTTAGAATGAATGCTGCTTAGTAAACAGCATGATAAGCTAGTAGAGGATCAGTCTGGTGCCAGCACCCGCGGTAACTCCAGCTCTGCAAGCCTATGATAATACTGTTGTCTTTAAAACGTCCGTAGCTAGCTTAAGCTAGTTGACTGCTAGTCTGTAGTATAGGCTAGCTAGTATTGCCTTTAGCTAGGTAGTGCTAGCTAGACTAACTACCTAGGTGGCTAGCTATAGCTAAGGCTGTAGCTAGCTGTAGGCTAGTAGTCTACTAGTGTAGCGACATCGCTAGTAAACAGGCATGCTCAAGGTTAGCTAAAGGGGTTTCCCTATAGCTAGAACTACATAGAGCGACACGTCAATATAGCTGGCTACTAGCTTGTCTAGTAGCCGGCGTGTACAATGAAGGGGTCTCAGGCTAGGAGTATTAGTAGGCTAGGGGTGAAATCCAGTGACCCTACTAAGACTACATGAGGCGAAAGCGCCTAGCTAGAGCCTACTTGTCGGCGATGGACGAGAGTGAAGGGCGCGAAGATGATTAGATACCGTTGTAGTCTTCACTGTAAACTATGCCAGCTTGGCTAGGTAGCTTCACCTAGTTGAAGGCTACCTAGCCATAGGGAAACTAGTAAGCTCTTCGGCTAGAGGGGTAGTACTGTCGCAAGGCAGAAACTTAAAGGAATTGACGGAAAAGCACCACCAGGCGTGGAGTCTGCGGCTTAATTCGACTCAACACGGGGAACCTTACCAGGGCAGGACACTAAGTGGATTGTCAGCCTACAGGGCTTACATGATATTGTGGAAAGTGGTGCATGGCCATTCTTAGTTCGTGAGGCGACTTGTCTCCTTAATTGAGGTAACGAGCGAGACCCTGAAGGCTAGTTGAGCCTAGCTAGCTAGTCTAGCTAGTGTGCTACCTAGTAGCCGCTCCTAGCCTGATTACTAGCGCCTAGCTAGTCGAGACAGGACAATAATAGGTCTGTTATGCCCTTAGATGGCCTGGGACGCACGCGTACTACACTGACACTCGCAGCCAGTATAGTAGCTACTAGCTATCTAGCTAGTAGCTAAGCTATGCCGTAAGGCATTGCGAAGCTGCAAAGGGTGTCTAGGTAAGGATTGCTGCCTGCAAGGGCAGCATCAATGAGGAATGCCTTGTAAGTGCCTTTCAGCATAAGGCTCTGAATACGTCCCTGCTTTTTGTACACACCGCCCGTCGCATCAAGGAACTCAACTAGGCCAATGAGCCTGCTGGACGCAAGGAAAGCAGGCAAA
>seq_B AACCTGGTTGATCCTGCCAGTACCATAGGCTAGTCTTGAAGACTAAGCCATGCATGTGTCAGTGCACAGCCTAGTTTACTAGGCATTCGGAAACTGCGGATGGCTCCATATACTAGTGAGAACCTGCCTGCCCAGTCAACTACAAGGATAACCTTGCCAACAGCAAGGCTAATACTTGATAAACAGCTAGCTTCGGCTAGTAGCGCAAGGCTAAGCTTGACGGCTACCTTGCATAGTAGGCTTCGGCTTACTGGTTGACCTGAAAGGATTAAGAGCTGCTACTAGTAGTAGCTCGACCTAAGGCTGAGCAGAGAGCTCTCTAAGCCATTAGCTGGTTGTGGGGGTAGCGGCTTCACAAGGCGACTATGGCTACGGGGAATCAGCGTTCGATTCCGGAGAGAAAGCCTGAGACACGGCTATCACATCTAAGGAAGGCAGCAGGCGCGCAAATTGCCCAATGCAGATTCTGCGAGGCAGTGAACAGCCATACCGGCCTGCGCACCTAGTGCGCAGGGAGGCTTTAAGCCTTAGAATGAATGCTGCTTAACAAACAGCATGATAAGCTAGTAGAGGATCAGTCTGGTGTTGTCTTTAAACGTCCGTAGCCAGCTAAAGCTGCTAAGCTACTAGTCTGCAGCACAGGCTAGCTTCAACTACCTTTCGCTAGGTAGTAGTTGCTAGACTGGCTGCCTTGGCGGAAGCCTTTAGCATAAGCTGAAGGCTACTGTAGGCTGGTAGTTTAGTAGCGTAGCGACATCGCTAGTAAACAGGCATGCTCAAGGTAAGCTAAAGGGTTTTCCCTATGGCTTGAACTATATAGAGCGACACGTCAGTCAAGCAGGCTACTTCGGTAGCCGGCGTGAACAATGAGGGGGACTCAGGCCAGGAGTATTGGCAGGCTAGGGGTGAAATCCAGTGACCCTGCCAAGACTGCATGAGGCGAAAGCGCCTGGCTAGAGCCTACTTGTCGGCGATGGACGAGAGTGAAGGGCGCGAAGATGATTAGATACCGTTGTAGTCTTCACTGTAAACTATGCCAGCTTGGCTGGGCAGCTTCACCTAGTTGAAGGCTGCCTGGCCAAAGGGAAACTAGTAAGCTCTTCGGCTAGAGGGGTAGTACTGTCGCAAGGCAGAAACTTAAAGGAATTGACGGAAAAGCACCACCAGGCGTGGAGTCTGCGGCTTAATTCGACTCAACACGGGAAACCTTACCAGGGCAGGACACTAAGTGGATTGTCAGCCTACAGGGCTTACATGATATTGTGGAAAGTGGTGCATGGCCATTCTTAGTTCGTGAGGCGACTTGTCTCCTTAATTGAGGTAACGAGCGAGACCCTGAAGGTTAGTTGAGCTTGGCTAGTTTACTAGTCAATGTGCTGCTTCGCAGCCGCTCCTAACCTGATTACTGGCGCCTAGCCAGTCGAGACAGGACAATAATAGGTCTGTTATGCCCTTAGATGGCCTGGGACGCACGCGTACTACACTGACGCTCGCAGCCAGTAGAGTAGCGGCTAGTTTACTAGCCGCTAAGCTATGCTGGAAAGCATTGCGAAGCTGCAAAGGGCGTCTCGGTAAGGATTGCTGCCTGCAAGGGCAGCATAAATGAGGAATGCCTTGTAAGCGCCCTTCAGCATAGGGCACTGAATACGTCCCTGCTTTTTGTACACACCGCCCGTCGCATCAAGGAACTCAACTGAGGCAATGAGCCTGCCGGACGCAAGGAAGGCAGGCGAATAGCCTTAGTCTTACCATGATGAAGTCGTAACAAGGTATCCGTAGGTGAACCTGCAGAAGGATCA
Finally, imagine that you obtained planktonic samples from Lake Erken and from the Baltic. You isolate some similar looking cells and examine them with an electron microscope. In Figure 6, cell A is from Lake Erken while cell B is from the Baltic.
Fig 6. Images reused from Logares et al. 2007, Microbial Ecology
Questions:
6. Based on morphology and ecology, are these protists different species?
7. Compare the 18S sequences of Cell A (Accession number: AY970662) and Cell B (Accession number: AY970653) using Blast2seq. What is the percentage identity? What could explain these results?
Based on these three scenarios, what can we say about what an OTU/18S sequence represents? The answer is that it depends. There are cases where morphology and DNA sequences are in agreement (such as in Scenario 1). However, there are also cases where different species can have identical 18S sequences (such as in Scenario 3). There are also cases where the opposite can happen - i.e. 18S sequences from the same species (or even the same cell!) are very different from each other. Despite these limitations, metabarcoding is an incredibly important method for documenting the organisms present in a sample in a high-throughput way. For instance, it can often be difficult to identify organisms based on morphology (such as in Scenario 2), and by sequencing the 18S gene, we can more easily place organisms in their respective clades. Next, we will play with metabarcoding data from the six Tara Oceans datasets and investigate the microbial eukaryotic diversity in marine habitats.
We will start with identifying all of the OTUs in our two marine environments (coral reef and open ocean), and assigning them taxonomy. This is done by searching the OTUs against a reference database. BLAST the sequences below against the GenBank database.
>otu_1485;613
gtcgctactaccgattgaacgttttagtgagacatttggactgggtcagtgtaggctttcatgcct acgttgtctccggaaagactttcaaacttgagcgtttagaggaagtaaaagtcgtaacaaggtttcc
>otu_58431;9
gtcgcacctaccgattggatgtttcgatgaggccctcggaccgtggcacgtcaccttgactggcaa cgcgctttgggaagttgtccaaatctcaacatctagaggaaggtgaagtcgtaacaaggtttcc
>otu_5604;24
gtcgctcctaccgattgagtgatccggtgaattacttggattgcagcagggctcagtgtgtgaact ttgctggagaaatgtcatgaaccttattacttagaggaaggagaagtcgtaacaaggtttct
Questions:
8. How long are the nucleotide sequences (metabarcodes) on average?
9. Considering only the top hit, identify the three OTUs to species or genus level. Which supergroup/major lineage do they belong to? Search them on Google images to bond with them a little.
10. What role do these three organisms carry out in the oceans? (Briefly read on Wikipedia/somewhere else).
11. In our fasta files, the fasta header describes two things, the unique OTU identifier (before the “;”), and the number of reads in that OTU (number after the “;”). Given that information, which organism is the abundant out of the three?
To identify all of the thousands of OTUs in our datasets, we will use the same approach as above but using automated methods. Here, instead of searching against GenBank which contains a lot of mislabelled sequences, we will use a curated database called PR2. This is a curated reference dataset specfically for the eukaryotic 18S gene. As in de Vargas et al. 2015 (the Tara Oceans paper), we will assign OTUs to a taxonomic lineage if they are at least 80% similar to a reference sequence in PR2. If they are less than 80% similar, they are considered “unassigned”.
Let’s have a quick look at the reference database, PR2, modified to include only the V9 region for the Tara Oceans data. As you can see, the fasta headers contain the taxonomy of the respective sequences.
head data/Database_W2_v9_pr2.fasta
## >AF272045 Eukaryota|Alveolata|Dinophyta|Dinophyceae|Gymnodiniales|36|Kareniaceae_12|Karlodinium_05|micrum
## gtcgctcctaccgattgagtgatccggtgaataattcggactgcagcagtgtttagttcctgaacgttgcagcggaaagtttagtgaaccttatcacttagaggaaggagaagtcgtaacaaggtttcc
## >AF272049 Eukaryota|Alveolata|Dinophyta|Dinophyceae|Gymnodiniales|39|Kareniaceae_26|Karlodinium_10|micrum
## gtcgctcctaccgattgagtgatccggtgaataattcggactgcagcagtgtttagttcctgaacgttgcagcggaaagtttagtgaaccttatcacttagaggaaggagaagtcgtaacaaggtttcc
## >AM494500 Eukaryota|Alveolata|Dinophyta|Dinophyceae|Gymnodiniales|38|Kareniaceae_16|Karlodinium_07|micrum
## gtcgctcctaccgattgagtgatccggtgaataattcggactgcagcagtgtttagttcctgaacgttgcagcggaaagtttagtgaaccttatcacttagaggaaggagaagtcgtaacaaggtttcc
## >AF272046 Eukaryota|Alveolata|Dinophyta|Dinophyceae|Gymnodiniales|35|Kareniaceae_04|Karlodinium_03|micrum
## gtcgctcctaccgattgagtgatccggtgaataattcggactgcagcagtgtttagttcctgaacgttgcagcggaaagtttagtgaaccttatcacttagaggaaggagaagtcgtaacaaggtttcc
## >AF272050 Eukaryota|Alveolata|Dinophyta|Dinophyceae|Gymnodiniales|48|Kareniaceae_39|Karlodinium_16|micrum
## gtcgctcctaccgattgagtgatccggtgaataattcggactgcagcagtgtttagttcctgaacgttgcagcggaaagtttagtgaaccttatcacttagaggaaggagaagtcgtaacaaggtttcc
For simplicity, we have already searched our OTUs against PR2. Now let’s have a broad overview of the organisms are present in our samples! In the plots below, coral reef stations will be coloured in shades of pink/coral and open ocean stations in shades of blue.
What organisms are present in our marine samples?
# load necessary packages
library(ggplot2)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
# read in data
stations <- read.csv("data/tara.otus.taxonomy.tsv", header=F, sep = "\t")
colnames(stations) <- c("group", "no_of_OTUs", "no_of_reads", "station")
# normalize data by station
norm_stations <- stations %>%
group_by(station) %>%
mutate(perc_OTUs = (no_of_OTUs / sum(no_of_OTUs))*100) %>%
mutate(perc_reads = (no_of_reads / sum(no_of_reads))*100)
# make custom theme for graphs
cust_theme <- theme_dark() +
theme(axis.text.y = element_text(angle=90, hjust=0.5), axis.text.x = element_text(angle=90, vjust=0.5, hjust=1), strip.text.x = element_text(angle=180),
panel.grid.major.x = element_blank(), panel.grid.minor.x = element_blank(),
strip.placement = "outside",
panel.background = element_rect(fill = "white"),
strip.background =element_blank(),
panel.spacing = unit(0, "lines"),
axis.ticks = element_blank())
# order stations so they appear in the desired order
norm_stations$station <- factor(norm_stations$station, c("station46", "station49", "station121", "station45", "station50", "station126"))
# plot richness/diversity
ggplot(norm_stations, aes(x = group, y = perc_OTUs)) +
geom_bar(aes(fill = station), color="white", stat = "identity", position = "dodge") +
scale_fill_manual(values = c("station121" = "#F8766D", "station126" = "#00BFC4", "station45" = "#0B8CDB", "station46" = "#E07B58", "station49" = "#E05893", "station50" = "#0BDBA8")) +
xlab("") + ylab("% of OTUs") +
ggtitle("Diversity") +
cust_theme
# plot abundance
ggplot(norm_stations, aes(x = group, y = perc_reads)) +
geom_bar(aes(fill = station), color="white", stat = "identity", position = "dodge") +
scale_fill_manual(values = c("station121" = "#F8766D", "station126" = "#00BFC4", "station45" = "#0B8CDB", "station46" = "#E07B58", "station49" = "#E05893", "station50" = "#0BDBA8")) +
xlab("") + ylab("% of reads") +
ggtitle("Abundance") +
cust_theme
Questions:
12. The plots generated show relative species diversity and abundance. Why is it better to show relative values than absolute values?
13. A large proportion of OTUs could not be classified (“Unassigned”). What could they represent?
14. What is the most diverse lineage of organisms in each environment? Is it heterotrophic or photosynthetic?
15. What is the most abundant lineage of organisms in each environment? Is it heterotrophic or photosynthetic?
16. When estimating abundance of organisms, we assume that the number of reads is proportional to the number of respective organism; i.e. we assume that the species with the largest number of reads is the most abundant. Can you think of other factors that may affect the number of reads?
Now let us take a closer look at the most dominant major lineages in the global ocean (i.e. data from all Tara Ocean stations; data obtained from the Tara Oceans Companion Website, Dataset W6).
In the treemaps generated below, each group is represented by a rectangle, and the area of each rectangle is proportional to its value. Hover and click for interactivity on the interactive treemap. Hovering over a group will also give its value :)
Most diverse lineages
# load necessary packages
library(treemap)
library(d3treeR)
# dataset
data <- read.csv("data/w6.all-tara.tsv", header=T, sep = "\t")
# basic treemap
p_diversity <- treemap(data,
index=c("Supergroup","Major_lineage"),
vSize="No_of_OTUs",
type="index",
palette = "Set2",
bg.labels=c("white"),
align.labels=list(
c("center", "center"),
c("right", "bottom")
)
)
# make it interactive ("rootname" becomes the title of the plot):
inter_diversity <- d3tree2( p_diversity , rootname = "Hover and click groups for interactivity" )
inter_diversity
Questions:
17. List the three most diverse lineages in the global oceans (excluding Metazoa) as well as their ecological role in the oceans.
Most abundant lineages
# basic treemap
p_abund <- treemap(data,
index=c("Supergroup","Major_lineage"),
vSize="No_of_reads",
type="index",
palette = "Set2",
bg.labels=c("white"),
align.labels=list(
c("center", "center"),
c("right", "bottom")
)
)
# make it interactive ("rootname" becomes the title of the plot):
inter_abund <- d3tree2( p_abund , rootname = "Hover and click groups for interactivity" )
inter_abund
Questions:
18. List the three most abundant lineages in the global oceans (excluding Metazoa) as well as their ecological role in the oceans.
Before metabarcoding studies, our view of microbial diversity in the oceans was largely informed by describing morphological species that were cultured. The Tara Oceans expedition in particular helped change our view of planktonic diversity and discovered lots of novel diversity.
The plot below compares the number of reference sequences in PR2 and total Tara-Oceans V9 OTUs for a subset of groups. Note that reference sequences in PR2 are not derived from marine environments, but also include soils and freshwater organisms.
# Morphological species vs. Tara OTUs
data <- read.csv("data/pr2_otus.tsv", header=T, sep = "\t")
ggplot(data, aes(fill=Source, y=Number, x=Group)) +
geom_bar(position="dodge", stat="identity") +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
Questions:
19. Why are the groups Myxogastrea and Streptophyta found relatively less in Tara Oceans data?.
20. In which groups was the most hidden diversity discovered? List three groups. Roughly how many folds was the diversity increase?
Finally, let’s investigate how similar Tara Oceans OTUs from the six chosen stations are to reference sequences in PR2.
data <- read.csv("data/similarity.tsv", header=T, sep = "\t")
ggplot(data, aes(x=perc_identity, fill=perc_identity)) +
geom_histogram(binwidth = 1, boundary = 0, closed = "left", color="white", fill=rgb(247,128,0, maxColorValue = 255)) +
theme_light() +
xlim(50,100) +
xlab("Similarity to reference sequence (%)") +
ylab("Number of OTUs")
echo 'The total number of OTUs at these six stations is:'
cat data/similarity.tsv | wc -l
echo 'The number of OTUs between 80-90% similar to the reference database is:'
cat data/similarity.tsv | awk '$2 < 90' | awk '$2 > 80' | wc -l
echo 'The number of OTUs less than 80% similar to the reference database is:'
cat data/similarity.tsv | awk '$2 < 80' | wc -l
## The total number of OTUs at these six stations is:
## 57740
## The number of OTUs between 80-90% similar to the reference database is:
## 12311
## The number of OTUs less than 80% similar to the reference database is:
## 9631
Questions:
21. Sequences divergent from references represent novel diversity! Indeed, sequences that are less than 80% similar to any reference, cannot even be assigned to any major lineage or supergroup. What percentage of sequences are between 80-90% similar to references, and what percentage are less than 80% similar?
These results are exciting because they indicate how much we still have to learn about the organisms that keep our oceanic ecosystems functioning!
Metabarcoding data can also be used to investigate the biodiversity and ecology of taxa of interest.
In this part of the lab, we will focus on one particular genus, Symbiodinium, and investigate its distribution across the six stations (coral and open ocean). Symbiodinium is a genus of photosynthetic dinoflagellates. These micro-algae, colloquially called “zooxanthellae” are most well-known for being symbionts and the primary producers in reef-building corals. They are thus ecologically and economically important. There has been much research carried out on Symbiodinium biology and ecology in recent yeras in order to understand global coral reef decline.
Questions:
22. Do you expect to find Symbiodinium in one or both environment types? Explain your reasoning.
Consider the following OTUs from station 49 (coral reef) which have been classified as Symbiodinium.
# station 49
cat data/vsearch_49.tsv | grep 'Symbiodinium' | tr '|' '\t' | cut -f 1,2,10,11
## otu_3007;366 100.0 SymbiodiniumCladeA sp.
## otu_3567;1 98.4 SymbiodiniumCladeD sp.
## otu_432448;1 97.7 SymbiodiniumCladeA sp.
## otu_5241;692 100.0 SymbiodiniumCladeC sp.
## otu_162983;2 100.0 SymbiodiniumCladeC sp.
## otu_13108;6 80.8 SymbiodiniumCladeD sp.
## otu_539869;1 89.9 SymbiodiniumCladeE californium
## otu_288404;6 96.9 SymbiodiniumCladeD sp.
## otu_107980;1 98.4 SymbiodiniumCladeA sp.
Column 1 represents the OTU, Column 2 represents % similarity to closest reference, Columns 3 and 4 represent genus and species respectively.
Questions:
23. How confident can you be that otu_3007;366 is a Symbiodinium? Why?
24. How confident can you be that otu_539869;1 is a Symbiodinium? Why?
To resolve this, we will use the Evolutionary Placement Algorithm to “place” sequences in a reference tree. This is a very useful technique for classifying millions of short environmental reads or OTUs. Figure 7 shows an overview of the process.
station 49, coral
Questions:
25. How many OTUs are actually Symbiodinium? (These are OTUs that fall in the Symbiodinium clade.)
26. Are the results consistent with your answers to questions 23 and 24?
Similar analyses have been carried out for you for the other stations. View the trees here below.
station 46, coral
station 121, coral
station 45, open ocean
station 50, open ocean
station 126, open ocean
Questions:
26. Are the results consistent with your answer to question 22?
Thus far, we have looked at presence/absence of Symbiodinium in coral reef waters and open ocean waters. We can also investigate if there is any difference in abundance (i.e. number of Symbiodinium reads from each of the stations).
# read data
sym_abundance <- read.csv("data/sym_abundance.txt", header = T, sep = "\t")
# order stations in the order we wangt to appear in the graph
sym_abundance$stations <- factor(sym_abundance$stations, c("station_45", "station_46", "station_50", "station_49", "station_121", "station_126"))
# custom theme
require(ggplot2)
cust_theme <- theme_dark() +
theme(axis.text.y = element_text(angle=90, hjust=0.5), axis.text.x = element_text(angle=90, vjust=0.5, hjust=1), strip.text.x = element_text(angle=180),
panel.grid.major.x = element_blank(), panel.grid.minor.x = element_blank(),
strip.placement = "outside",
panel.background = element_rect(fill = "white"),
strip.background =element_blank(),
panel.spacing = unit(0, "lines"),
axis.ticks = element_blank())
# plot
ggplot(sym_abundance, aes(x = stations, y = rel_abundance)) +
geom_bar(aes(fill = env_type), color="white", stat = "identity", position = "dodge") +
xlab("") + ylab("Symbiodinium abundance (% of reads)") +
ggtitle("Symbiodinium abundance") +
cust_theme
Questions:
27. Is there a trend in abundance of Symbiodinium?
28. Based on these results, what can we say about the lifestyle of Symbiodinium? Does it exclusively live through symbiosis with corals?
29. Does Symdiodinium have a free-living stage, or does it form symbioses with other organisms? We can test this hypothesis by looking at the presense and abundance of Symbiodinium in different size fractions on an online server. (Follow the instructions below and answer this question. Recall that Symbiodinium cells are usually between 6-13 microns in diameter.).
The Ocean Barcode Atlas is an online web service for rapidly analysing the biogeography of taxa using the entire Tara Oceans dataset. Go to the website and click on Community ecological analysis. In the taxonomy field, type in “Symbiodiniaceae”, untick computation of beta diversity, set maps and bubble plots to 1, and submit your job.
Select the following size fractions: 0.8-5 microns, 5-20 microns, 20-180 microns, 180-2000 microns. What does this tell us about the life cycle of Symbiodinium?
Chytridiomycota (informally chytrids) are fungi with flagellated zoospores which have recently been uncovered as important parasites or saprotrophs of diatoms in marine systems (Figure 8).
Fig 8: Scanning electron microscographs of pennate diatoms infected by chytrid-like fungi which are coloured in red (image from Kilias et al. 2020. Communications Biology)
Question:
30. Using the Ocean Barcode Atlas again, can you determine which temperature range do Chytridiomycota prefer? Are they more abundant in tropical waters?
What factors explain the microbial community compositions in different environmental samples? We will now return to our six Tara Ocean stations (coral and non-coral), and quantify the differences in microbial communities between the stations, and cluster them based on their differences. The measure of dissimilarity we will use is Bray-Curtis, which is commonly used in ecology. You do not need to know how it works, however if you are interested in a short, easy to read summary, see here
Importantly, the measure is between 0 and 1. Identical communities would have a Bray-Curtis dissimilarity of 0, while communities which share no species would have a Bray-Curtis dissimilarity of 1.
# load required package
library(vegan)
## Loading required package: permute
## Loading required package: lattice
## This is vegan 2.5-6
# read data
matrix <- read.csv("data/chosen_stations_filtered_otus.tsv", header=T, sep = "\t", row.names = 1)
# view the data we are working with. The table contains read counts for each OTU in every station. Each row = OTU, each column = station.
head(matrix)
## station46 station49 station121 station45 station50 station126
## otu_1 160143 191068 63806 165918 48066 29592
## otu_2 505 651 322 1079 446 690
## otu_3 14895 10798 84622 37216 63754 51660
## otu_4 172454 20426 38846 188886 31332 467641
## otu_5 36236 12180 126136 70310 37317 213247
## otu_6 114638 64892 12028 187475 116723 7397
# transpose data so that rows and columns are switched
matrix_t <- t(matrix)
# normalise data
matrix_t_norm <- decostand(matrix_t, method = "total")
# calculate Bray-Curtis distance and cluster the stations
matrix.bc.dist <- vegdist(matrix_t_norm, method = "bray")
matrix.bc.clust <- hclust(matrix.bc.dist, method = "average")
plot(matrix.bc.clust, ylab = "Bray-Curtis dissimilarity")
Questions:
31. What explains the clustering that we see?
Finally, we are going to zoom out and compare the broad microbial communities from ocean, lake, and soil samples. To get an overview of the communities, we will phylogenetically place the metabarcoding sequences onto a reference phylogeny as we did for Symbiodinium previously. This time however, we will place OTUs on a global eukaryotic phylogeny with 155 annotated, reference 18S sequences (these sequences are derived from the PR2 18S dtaabase) (Figure 8). Note that because this tree is based on 18S sequences only, it is not entirely consistent with what we know from phylogenomics. However, this does not impact phylogenetic placement.
Fig 8: Reference 18S phylogeny of eukaryotes with major groups annotated.
The three datasets we will be comparing today are the Malaspina dataset (a global marine expedition like the Tara Oceans), a freshwater dataset from Lake Baikal, and a dataset from Neotropical forest soils. We’re opting for the Malapspina dataset instead of the Tara because like the soil and freshwater data, it covers the longer, V4 region instead of the V9 region.
Let’s have a cursory look at how many OTUs are in each dataset (NB: normally, we would normalize data first before comparisons, but for the purpose of this lab, it is okay to proceed with un-normalised data.)
Marine - Malaspina
# marine
cat data/malaspina.fasta | grep -c '>'
## 8217
Neotropical forest soils
# soils
cat data/neotropical-soils.fasta | grep -c '>'
## 28474
Freshwater - lake Baikal
# freshwater
cat data/baikal.fasta | grep -c '>'
## 1125
The phylogenetic placement has already been carried out. Download the placement files from GitHub. To do so, right-click each file and click “save link as”.
Now for the fun part! We will use the interactive Tree of Life website to visualize the difference between these three environments. Simply load each .jplace file (using the button “Choose file”, selecting the file, and click “Upload”). I recommend using three tabs so you can view all files at the same time.
You will now see a phylogenetic tree and a control panel. Select the following options in the control panel:
You will now see bubbles on the tree branches. Each bubble represents OTU placements on a particular branch. The bigger the bubble, the more OTUs belong to that particular lineage!
Questions:
32. What are the most dominant lineages in soils and freshwater samples?
33. In which environment are fungi the most dominant?
34. Which lineages are dominant in the oceans, but absent in freshwater and soils?
35. Are there lineages present in freshwater/soils but absent in oceans? Which ones?
36. List any other observations you find interesting!
Discuss your answers with your TA :)